AITopics | lr decay

Collaborating Authors

lr decay

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Overleaf Example

Neural Information Processing SystemsFeb-9-2026, 21:15:32 GMT

Experiments show that the proposed ReBalanced Adversarial Training (ReBA T) can attain good robustness and does not suffer from robust overfitting even after very long training.

artificial intelligence, machine learning, non-robust feature, (17 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Leisure & Entertainment (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.43)

Add feedback

32fcc8cfe1fa4c77b5c58dafd36d1a98-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-8-2026, 00:49:08 GMT

accuracy, rate schedule, test accuracy, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Luo, Kairong, Sun, Zhenbo, Wen, Haodong, Shi, Xinyu, Cui, Jiarui, Dang, Chenyi, Lyu, Kaifeng, Chen, Wenguang

arXiv.org Artificial IntelligenceNov-25-2025

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.18903

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Overleaf Example

Neural Information Processing SystemsOct-8-2025, 10:12:15 GMT

Experiments show that the proposed ReBalanced Adversarial Training (ReBA T) can attain good robustness and does not suffer from robust overfitting even after very long training.

lr decay, non-robust feature, robustness, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Leisure & Entertainment (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

32fcc8cfe1fa4c77b5c58dafd36d1a98-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 15:13:28 GMT

We thank the reviewers for their detailed comments. Please see our response below. "... common implementation of weight decay [1] will usually multiply the amount of weight decay by the learning " The same holds in our setup: We have an "How do different learning rate schedules affect the conclusion?": We address LR schedule questions below. "It would be great if the authors can provide more experiments on ... AUTOL2" We ran additional experiments "((1)) If I could have access to the test set... " . We reject the claim that our submission "violates the ethics of "((2)) I have concerns on comparing AutoL2... " . Experiments with lr decay and AutoL2 are presented in the SM. "((3))) The practically of the proposed work... "... more insights on the relation between learning rate scheduler and AutoL2... " We address this point in the "... the lambda update refractory period is not detailed ... " The refractory period lasts for "It would be interesting to see on the same graph, training with learning rate scheduler ... " In the SM we have the "In Figure 1a and 1b, how is the best test accuracy determined?... " In Figs.

artificial intelligence, machine learning, rate schedule, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Balance, Imbalance, and Rebalance: Understanding Robust Overfitting from a Minimax Game Perspective

Wang, Yifei, Li, Liangchen, Yang, Jiansheng, Lin, Zhouchen, Wang, Yisen

arXiv.org Machine LearningOct-30-2023

Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after learning rate (LR) decay. In this paper, we explain this phenomenon by viewing adversarial training as a dynamic minimax game between the model trainer and the attacker. Specifically, we analyze how LR decay breaks the balance between the minimax game by empowering the trainer with a stronger memorization ability, and show such imbalance induces robust overfitting as a result of memorizing non-robust features. We validate this understanding with extensive experiments, and provide a holistic view of robust overfitting from the dynamics of both the two game players. This understanding further inspires us to alleviate robust overfitting by rebalancing the two players by either regularizing the trainer's capacity or improving the attack strength. Experiments show that the proposed ReBalanced Adversarial Training (ReBAT) can attain good robustness and does not suffer from robust overfitting even after very long training. Code is available at https://github.com/PKU-ML/ReBAT.

lr decay, non-robust feature, robustness, (15 more...)

arXiv.org Machine Learning

2310.1936

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Why Do We Need Weight Decay in Modern Deep Learning?

Andriushchenko, Maksym, D'Angelo, Francesco, Varre, Aditya, Flammarion, Nicolas

arXiv.org Artificial IntelligenceOct-6-2023

Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Weight decay serves to constrain the network capacity (Goodfellow et al., 2016) and acts as a mechanism for suppressing irrelevant weight components, aligning with the principles of Occam's razor (Krogh & Hertz, 1991). It is central in discussions on generalization bounds (Shalev-Shwartz & Ben-David, 2014), albeit a recent empirical study by Jiang et al. (2020) casts doubt on how well norm-based measures correlate with generalization for deep networks. Weight decay is also known to yield a regularization of the input-output Jacobian (Zhang et al., 2018) and to alter the training dynamics of scale-invariant networks by changing the effective learning rate (Van Laarhoven, 2017). Weight decay is widely used for training most state-of-theart deep networks such as GPT-3 (Brown et al., 2020), CLIP (Radford et al., 2021), or PALM (Chowdhery et al., 2022). We argue that despite its widespread usage, its effect is still poorly understood: in some cases it acts as a regularizer but in some cases as a tool for better optimization. Although the regularization effect of weight decay is thoroughly studied in classical learning theory, deep networks are already equipped with strong implicit regularization coming from the parameter initialization, optimization algorithm, and architecture (Zhang et al., 2016). Moreover, recent years have brought along new architectures and settings such as transformers (Vaswani et al., 2017) and nearly one-epoch language modelling (Brown et al., 2020; Hoffmann et al., 2022).

decay, regularization, weight decay, (14 more...)

arXiv.org Artificial Intelligence

2310.04415

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback